Skip to content

release(0.23.2): merge release branch + post-release smoke fixups into develop#132

Merged
michalharakal merged 11 commits into
developfrom
chore/release-0.23.2
May 5, 2026
Merged

release(0.23.2): merge release branch + post-release smoke fixups into develop#132
michalharakal merged 11 commits into
developfrom
chore/release-0.23.2

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Summary

Brings chore/release-0.23.2 (already tagged as 0.23.2 on commit 6eec93a) plus three post-release smoke-tooling fixes onto develop.

Release content (tagged 0.23.2)

  • edb366c fix(tool-calling): tolerate markdown code fences around Llama 3 JSON
  • 5b34ee9 fix(kllama-cli): route Llama GGUF/SafeTensors back to eager LlamaRuntime — recovers Llama 3 tool-calling functionality at the cost of staying on the legacy path until the DSL gets first-class Q4/Q8 DTypes
  • 5c3b9fa feat(kllama-cli): log prompts, raw responses, and tools list in ToolCallingDemo
  • 74c1416 fix(llama): inject logical 2D shape and dequant token_embd in DSL converter (now Qwen-only)
  • 1e7af50 test(smoke): add Llama-3.2-1B-Instruct entry with tool-calling assertion
  • cea3173 docs(tool-calling): end-to-end Llama 3 setup walkthrough for app integrators
  • 40200da chore(api): refresh public API dumps for 0.23.2
  • 6eec93a release: bump version to 0.23.2

Post-release smoke fixups (not in the 0.23.2 tag, no published-artifact change)

  • 412d0b6 fix(kbert-cli): apply application plugin so :run task is wired
  • 7abe110 fix(smoke): tolerate runners that don't emit tok/s (embedding models)
  • 70e936d test(smoke): add MongoDB/mdbr-leaf-ir embedding entry

These three only affect tests/smoke/ and kbert-cli's build script — they don't change anything published to Maven Central.

Test plan

  • :llm-agent:jvmTest, :llm-inference:llama:jvmTest, :llm-runtime:kllama:jvmTest — green
  • Parser regression tests for fenced JSON (3 new cases) — green
  • apiCheck — green (dumps refreshed via apiDump)
  • Smoke: Llama-3.2-1B-Instruct chat (0.37 t/s) + tool calling ([Tool Call] calculator → 4.0)
  • Smoke: MongoDB/mdbr-leaf-ir embedding (cosine 0.78 between two MongoDB-related sentences, 384-dim, ~290 ms/encode)
  • CI on this PR

Known followups (called out in the 0.23.2 tag annotation)

  • Recover the previous ~2 t/s baseline on Llama Q8 — needs first-class Q4/Q8 DTypes in the DSL or per-call SIMD dispatch in ops.matmul.
  • Bisect the residual perf gap on the eager path (upstream skainet pinned at 0.23.1, so it's not an upstream backend bump).

🤖 Generated with Claude Code

michalharakal and others added 11 commits May 5, 2026 08:58
Llama 3.2 1B Instruct sometimes wraps its tool-call JSON in a triple-
backtick fence (```...``` or ```json...```) even though the system
prompt instructs bare JSON. Llama31ToolCallParserStrategy required
candidate.startsWith("{"), so the fenced form silently parsed as
"no tool call" and the agent loop returned the raw JSON to the user
instead of executing the tool.

Add a stripCodeFence step that peels one layer of opening/closing
fence before the existing python-tag strip + balanced-brace extraction.
Both parse() and containsToolCall() use it so dispatch and detection
stay in lockstep.

Pinned by three new ToolCallParserTest cases (plain ```...```, ```json
tagged fence, and containsToolCall on fenced input).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the Llama branch of d519eb2 ("swap Llama GGUF + SafeTensors to
DSL path") in the JVM kllama-cli. The DSL path (DecoderGgufWeightLoader
→ DecoderGgufMemSegConverter → LlamaNetworkLoader.fromWeights →
OptimizedLLMRuntime DIRECT) is functionally correct but pays a per-
linearProject ops.transpose tax on packed Q4/Q8 weights — the kernel
still has to sniff the marker class on every call because the DSL
doesn't have first-class Q4/Q8 DTypes yet. Measured 0.24 t/s on
Llama-3.2-1B-Instruct-Q8 vs the legacy LlamaRuntime path that hits
the SIMD quant matmul kernel directly.

Restore the pre-d519eb2 wiring: LlamaIngestion(NATIVE_OPTIMIZED,
allowQuantized=true) → MemSegWeightConverter.convert → CpuAttentionBackend
→ LlamaRuntime<FP32> (with @Suppress("DEPRECATION")). Fold the BIN
branch into the same `else` block since it was already on LlamaRuntime
anyway. Re-add the LlamaIngestion + LlamaLoadConfig + MemSegWeightConverter
imports; drop DecoderSafeTensorsLoader + LlamaNetworkLoader from imports.

Qwen GGUF stays on the DSL path (unchanged) — its Q8 perf is acceptable
on smaller batches and the parity test pins it.

Recovers tool calling functionality on Llama 3.2: the CLI smoke test
now emits [Tool Call] calculator(...) → [Tool Result] 4.0 cleanly.

Followup: either give the DSL first-class Q4/Q8 DTypes so linearProject
can dispatch directly, or push the SIMD kernel selection deeper into
ops.matmul so the per-call transpose disappears. Tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allingDemo

The smoke test for tool calling previously printed only [Tool Call] and
[Tool Result] markers, which made it hard to debug a missed call (was the
prompt malformed? did the model emit something the parser couldn't see?).

ToolCallingDemo.runSingleShot now also prints, before generation:
  - [Tools] block with each tool's name, description, and JSON schema
  - [Prompt → Round 1] with the full chat-template-rendered prompt the
    model will actually receive (system + tools + user)

…and during the agent loop the listener prints:
  - [Raw Assistant → Round N] with the model's exact output per round
  - [Tool Call Invalid] when a tool call fails JSON-Schema validation

…and after the loop completes:
  - [Final Conversation] with every message in the accumulated history,
    so post-round-1 prompts can be reconstructed by the reader.

No behavior change to the agent loop itself; this is pure observability
on the demo path. The smoke-test.sh greps still match [Tool Call] /
[Tool Result] so PASS/FAIL accounting is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verter

DecoderGgufMemSegConverter wrapped Q4_0/Q8_0 GGUF tensors in
Q4MemorySegmentTensorData / Q8MemorySegmentTensorData using the loader's
intermediate Int8 byte-count Shape (1D = bytes.size). The DSL's
linearProject calls ops.transpose(weight) before matmul, and transpose
needs the logical 2D shape [out, in] from metadata to dispatch the
quant-aware kernel. Previously the kernel either rejected the 1D shape
or silently fell back to a generic path.

Compute the logical [out, in] shape per tensor name (attn_q/k/v/output,
ffn_gate/up/down, token_embd, output) from LlamaModelMetadata
(embeddingLength, headCount, kvHeadCount, feedForwardLength, vocabSize)
and pass it to the Q4/Q8 MemSeg wrappers. K-quants get the same logical
shape on dequant.

Special-case token_embd.weight: the Embedding layer consumes it via
gather (row indexing), not matmul. Packed Q4/Q8 bytes can't be gathered
as floats, and the loader's Int8 byte-count shape is rejected by gather
even after wrapping. Always dequantize token_embd to FP32 with the
[vocab, dim] shape regardless of quant type. (Tied output.weight stays
Q-packed because it's used by the LM head matmul, not gather.)

After the JVM kllama-cli Llama branch reverted to LlamaRuntime in the
previous commit, this converter is exercised only by the Qwen GGUF
branch and the Java facade KLlamaJava.loadGGUF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the Llama 3.2 chat + tool-calling path in the smoke suite so the
JSON-mode parser fix and the eager-LlamaRuntime revert can't silently
regress. Tool-calling prompt is "What is 2 + 2?" with 256 max steps;
the smoke runner asserts a [Tool Call] line is emitted.

Path resolves under MODELS_ROOT or the user's ~/.cache/standapp/models;
absent locally it reports FAIL (model path not found) cleanly without
masking other failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…integrators

Rewrites llama3-tool-calling.adoc into a getting-started for embedding
Llama 3 / 3.1 / 3.2 tool calling into a consumer Kotlin app, on top of
the existing format-internals reference.

New sections:
  - Quick start: actual CLI output a user sees ([Tools], [Prompt → Round 1],
    [Raw Assistant], [Tool Call], [Tool Result]) so the goal is concrete.
  - Use it from your own Kotlin app: 5 numbered steps —
      1. Add sk.ainet.transformers:llm-runtime-kllama + llm-agent (0.23.2)
         + --enable-preview --add-modules jdk.incubator.vector
      2. KLlamaJava.loadGGUF(Path.of(...)) — runtime + tokenizer in one call
      3. Define a tool — full WeatherTool example with ToolDefinition + JSON
         schema + execute()
      4. ChatSession(runtime, tokenizer, ModelMetadata(family="llama")) →
         ToolRegistry → createAgentLoop → runWithEncoder
      5. AgentListener snippet for prompt/answer/tool-call/tool-result logs;
         how to render the prompt yourself via chat.chatTemplate.apply(...)
  - Verify it's working: exact callback sequence, what to file if it breaks.
  - NOTE: Llama31ToolCallParserStrategy peels markdown code fences (this
    release's parser fix); also added to the "Parser accepts" bullet list.

The pre-existing format reference (Llama3ToolFormat.JSON / FUNCTION_TAG,
picking format programmatically, why two formats exist, model-size caveat,
related files) is preserved verbatim below the new walkthrough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerates the binary-compatibility-validator dumps via apiDump to
match the actual public surface of every module. Catches up the dumps
to the accumulated changes on the chore/release-0.23.2 branch:

  - llm-runtime/kllama: drops GpuAttentionBackend, GpuTensorBridge,
    LlamaIngestionBlocking; adds JavaTools.definition; KLlamaSession
    constructor takes InferenceRuntime instead of LlamaRuntime.
  - llm-providers: SkaiNetChatModel constructor takes a Set parameter
    where there used to be an Int (additional generation knobs).
  - llm-inference/llama: CpuAttentionBackend constructor takes a
    RopeType (Llama vs Qwen RoPE selection).
  - llm-inference/voxtral: new module, first API dump.
  - llm-agent, llm-core, apertus, bert, gemma, qwen, kgemma: assorted
    additions/cleanups from the DSL swaps and shared decoder body
    refactors merged into this release branch.

No new source-level public API changes in this commit — purely the
catch-up dump from prior commits + the new voxtral module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final commit on the chore/release-0.23.2 branch. Highlights since 0.23.1:

  - feat(kllama-cli/native): swap CLI to DSL path for Llama + Qwen
    (#125, #129, #130) on native and wasm targets.
  - cleanup(gpu): delete placeholder GPU attention/tensor stubs that
    always fell back to CPU; rename native benchmark scenario to
    native-cpu-throughput (#131).
  - feat(llm-core): wire SentencePiece decorator + GGUF tokenizer route
    through upstream sk.ainet.io.tokenizer; fix Qwen / GPT-2 BPE GGUFs
    (#52, #124).
  - fix(qwen): NEOX (SPLIT_HALF) RoPE pairing for Qwen3 GGUFs.
  - fix(transformer): thread metadata RMSNorm eps through QK-norm.
  - cleanup: delete :llm-runtime:kqwen module; remove
    LlamaIngestionBlocking.kt.
  - feat(kllama-java): swap KLlamaJava facade to DSL path.
  - fix(tool-calling): tolerate markdown code fences around Llama 3 JSON
    tool calls.
  - fix(kllama-cli): route Llama GGUF/SafeTensors back to eager
    LlamaRuntime for now — the DSL Q4/Q8 path is functionally correct
    but needs first-class Q4/Q8 DTypes to match the SIMD perf of the
    legacy path. Tracked as a followup.
  - docs: end-to-end Llama 3 tool-calling walkthrough for app integrators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without the kotlin application plugin, the project only exposed shadowJar
+ shadowJarDemo and Gradle reported "task 'run' not found". The smoke
test (tests/smoke/smoke-test.sh) dispatches every kbert entry via
:llm-apps:kbert-cli:run, so the kbert smoke leg was a no-op masked by
"model path not found" on the absent all-MiniLM-L6-v2.gguf fixture.

Apply the application plugin and pin mainClass to
sk.ainet.apps.bert.cli.MainKt — same pattern as :llm-apps:kllama-cli.
The existing JavaExec configuration already adds --enable-preview +
jdk.incubator.vector, so :run inherits the right JVM args.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`set -euo pipefail` made the no-match `grep -oE 'tok/s: [0-9.]+'`
fatal, killing the script silently right after a successful kbert
run because BertRuntime doesn't print throughput. Visible symptom:
the script exited cleanly mid-test with no PASS/FAIL line and no
summary table.

Append `|| true` to the substitution and the secondary `sed | grep |
sed` block so a missing tok/s falls back to "?" (the existing default)
instead of aborting the whole run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the BERT embedding path against an actual model that ships with
config.json + vocab.txt + tokenizer.json + 2_Dense projection — the
same layout BertNumericalAccuracyTest validates. Path resolves under
the standard HuggingFace cache (~/.cache/huggingface/hub/models--MongoDB--mdbr-leaf-ir/
snapshots/<rev>) so any developer who has snapshot-downloaded the
revision via huggingface_hub can run it without extra setup.

Prompt + doc are intentionally on-topic ("MongoDB is a NoSQL database"
vs "MongoDB stores data in BSON documents") so the cosine similarity
shows up high (~0.78 on local validation) — a sanity check that the
embedding pipeline isn't producing garbage.

Snapshot revision is pinned to the SHA returned by the December 2024
HF index. To refresh:
  uv run --with huggingface_hub python -c \
    "from huggingface_hub import snapshot_download; \
     print(snapshot_download(repo_id='MongoDB/mdbr-leaf-ir'))"

The pre-existing all-MiniLM-L6-v2 entry is left in place — it'll
report FAIL (model path not found) on machines that don't have it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 0fcd733 into develop May 5, 2026
4 checks passed
@michalharakal michalharakal deleted the chore/release-0.23.2 branch May 5, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant